Global Socioeconomic Data Analysis: GDP & Population¶
Web Scraping, Cleaning, Analysis & Visualization
Introduction¶
This project analyzes global socioeconomic indicators, focusing on GDP and population trends by country and continent. The goal is to explore relationships between population size, economic performance (GDP), and geographic distribution.
Data was scraped from Wikipedia, cleaned, merged, and visualized using Python.
Technologies Used:
Python (pandas, BeautifulSoup, requests)
Data Visualization (Matplotlib, Seaborn, Plotly)
Data Wrangling (pandas)
Objectives
Scrape GDP & Population datasets from Wikipedia.
Clean and align the datasets for analysis.
Visualize population and GDP trends across continents and countries.
Identify socioeconomic disparities.
Demonstrate end-to-end data workflow.
Data Sources
| Dataset | Source | URL |
|---|---|---|
| GDP by Country | Wikipedia: GDP (Nominal) | Link |
| Population by Country | Wikipedia: UN Population Estimate | Link |
Web Scraping Process¶
BeautifulSoup is a Python library used to parse and extract data from HTML or XML files.
pandas.read_html() quickly reads tables from a webpage and converts them into DataFrames. Tables are selected using attributes like class names, tags, or positional index to focus on relevant data.
Import necessary libraries¶
import requests # Used to send HTTP requests and fetch content from websites
from bs4 import BeautifulSoup # Parses HTML/XML content to extract and navigate elements for web scraping
import pandas as pd # Handles tabular data, allows you to store scraped tables in DataFrames and manipulate them
import matplotlib.pyplot as plt # Basic plotting library for creating charts and graphs (line, bar, scatter, etc.)
import seaborn as sns # Enhances matplotlib visuals; great for advanced statistical plots and aesthetic styling
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")
Step 1: Scrape GDP Data¶
# Step 1: Scrape GDP data
url_gdp = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
response = requests.get(url_gdp)
soup = BeautifulSoup(response.content, "lxml")
# Find the first table in the page (inspect manually via browser dev tools)
tables = pd.read_html(response.text)
gdp_table = tables[2]
# View the table
gdp_table.head()
| Country/Territory | IMF[1][12] | World Bank[13] | United Nations[14] | ||||
|---|---|---|---|---|---|---|---|
| Country/Territory | Forecast | Year | Estimate | Year | Estimate | Year | |
| 0 | World | 113795678 | 2025 | 111326370 | 2024 | 100834796 | 2022 |
| 1 | United States | 30507217 | 2025 | 29184890 | 2024 | 27720700 | 2023 |
| 2 | China | 19231705 | [n 1]2025 | 18743803 | [n 3]2024 | 17794782 | [n 1]2023 |
| 3 | Germany | 4744804 | 2025 | 4659929 | 2024 | 4525704 | 2023 |
| 4 | India | 4187017 | 2025 | 3912686 | 2024 | 3575778 | 2023 |
Step 2: Scrape Population Data¶
# Scrape population data
url_population = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
response = requests.get(url_population)
soup = BeautifulSoup(response.content, "lxml")
# Find population table
tables = pd.read_html(response.text)
population_table = tables[0]
population_table.head()
| Country or territory | Population (1 July 2022) | Population (1 July 2023) | Change (%) | UN continental region[1] | UN statistical subregion[1] | |
|---|---|---|---|---|---|---|
| 0 | World | 8021407192 | 8091734930 | +0.88% | – | – |
| 1 | India | 1425423212 | 1438069596 | +0.89% | Asia | Southern Asia |
| 2 | China[a] | 1425179569 | 1422584933 | −0.18% | Asia | Eastern Asia |
| 3 | United States | 341534046 | 343477335 | +0.57% | Americas | Northern America |
| 4 | Indonesia | 278830529 | 281190067 | +0.85% | Asia | South-eastern Asia |
# Restructure the Columns
gdp_table.columns = [
"Country",
"IMF_GDP_Billion_USD", "IMF_Year",
"WorldBank_Estimate", "WorldBank_Year",
"UN_Estimate", "UN_Year"
]
# Drop the first row
gdp_table = gdp_table.drop(index=0)
display(gdp_table.head())
| Country | IMF_GDP_Billion_USD | IMF_Year | WorldBank_Estimate | WorldBank_Year | UN_Estimate | UN_Year | |
|---|---|---|---|---|---|---|---|
| 1 | United States | 30507217 | 2025 | 29184890 | 2024 | 27720700 | 2023 |
| 2 | China | 19231705 | [n 1]2025 | 18743803 | [n 3]2024 | 17794782 | [n 1]2023 |
| 3 | Germany | 4744804 | 2025 | 4659929 | 2024 | 4525704 | 2023 |
| 4 | India | 4187017 | 2025 | 3912686 | 2024 | 3575778 | 2023 |
| 5 | Japan | 4186431 | 2025 | 4026211 | 2024 | 4204495 | 2023 |
# List column names
gdp_table.columns
Index(['Country', 'IMF_GDP_Billion_USD', 'IMF_Year', 'WorldBank_Estimate',
'WorldBank_Year', 'UN_Estimate', 'UN_Year'],
dtype='object')
# Remove Footnote Artifacts([n 1])
footnote_cols = ['IMF_GDP_Billion_USD', 'IMF_Year', 'WorldBank_Estimate',
'WorldBank_Year', 'UN_Estimate', 'UN_Year']
for col in footnote_cols:
gdp_table[col] = gdp_table[col].astype(str).str.extract(r"(\d{4})")
display(gdp_table.head())
| Country | IMF_GDP_Billion_USD | IMF_Year | WorldBank_Estimate | WorldBank_Year | UN_Estimate | UN_Year | |
|---|---|---|---|---|---|---|---|
| 1 | United States | 3050 | 2025 | 2918 | 2024 | 2772 | 2023 |
| 2 | China | 1923 | 2025 | 1874 | 2024 | 1779 | 2023 |
| 3 | Germany | 4744 | 2025 | 4659 | 2024 | 4525 | 2023 |
| 4 | India | 4187 | 2025 | 3912 | 2024 | 3575 | 2023 |
| 5 | Japan | 4186 | 2025 | 4026 | 2024 | 4204 | 2023 |
# Display the structure of the columns
gdp_table.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 221 entries, 1 to 221 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 221 non-null object 1 IMF_GDP_Billion_USD 180 non-null object 2 IMF_Year 189 non-null object 3 WorldBank_Estimate 198 non-null object 4 WorldBank_Year 209 non-null object 5 UN_Estimate 200 non-null object 6 UN_Year 212 non-null object dtypes: object(7) memory usage: 12.2+ KB
# Define columns
numeric_cols = ["IMF_GDP_Billion_USD", "WorldBank_Estimate", "UN_Estimate"]
year_cols = ["IMF_Year", "WorldBank_Year", "UN_Year"]
# Convert numeric estimates to integers
gdp_table[numeric_cols] = gdp_table[numeric_cols].apply(pd.to_numeric, errors="coerce")
gdp_table[numeric_cols] = gdp_table[numeric_cols].astype("Int64")
# Clean and convert year columns to datetime
for col in year_cols:
gdp_table[col] = pd.to_datetime(
gdp_table[col].astype(str).str.extract(r"(\d{4})")[0], format="%Y"
)
# Display cleaned dataset summary
print(gdp_table.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 221 entries, 1 to 221 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 221 non-null object 1 IMF_GDP_Billion_USD 180 non-null Int64 2 IMF_Year 189 non-null datetime64[ns] 3 WorldBank_Estimate 198 non-null Int64 4 WorldBank_Year 209 non-null datetime64[ns] 5 UN_Estimate 200 non-null Int64 6 UN_Year 212 non-null datetime64[ns] dtypes: Int64(3), datetime64[ns](3), object(1) memory usage: 12.9+ KB None
# Check NaNs
print(gdp_table.isna().sum())
Country 0 IMF_GDP_Billion_USD 41 IMF_Year 32 WorldBank_Estimate 23 WorldBank_Year 12 UN_Estimate 21 UN_Year 9 dtype: int64
gdp_data = gdp_table.dropna()
# Check if NaNs are removed
print(gdp_data.isna().sum())
Country 0 IMF_GDP_Billion_USD 0 IMF_Year 0 WorldBank_Estimate 0 WorldBank_Year 0 UN_Estimate 0 UN_Year 0 dtype: int64
Cleaning Population Data:¶
- Rename columns for clarity
- Drop missing values for consistency
# Restructure the Columns
population_table.columns = [
"Country", "Population_2022_Count", "Population_2023_Count", "Change_Percentage", "Continent","Region"
]
# Drop the first row
population_data = population_table.drop(index=0)
display(population_data.head())
print(population_data.info())
| Country | Population_2022_Count | Population_2023_Count | Change_Percentage | Continent | Region | |
|---|---|---|---|---|---|---|
| 1 | India | 1425423212 | 1438069596 | +0.89% | Asia | Southern Asia |
| 2 | China[a] | 1425179569 | 1422584933 | −0.18% | Asia | Eastern Asia |
| 3 | United States | 341534046 | 343477335 | +0.57% | Americas | Northern America |
| 4 | Indonesia | 278830529 | 281190067 | +0.85% | Asia | South-eastern Asia |
| 5 | Pakistan | 243700667 | 247504495 | +1.56% | Asia | Southern Asia |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 237 entries, 1 to 237 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 237 non-null object 1 Population_2022_Count 237 non-null int64 2 Population_2023_Count 237 non-null int64 3 Change_Percentage 237 non-null object 4 Continent 237 non-null object 5 Region 237 non-null object dtypes: int64(2), object(4) memory usage: 11.2+ KB None
# Check for blanks
population_data.isnull().sum()
Country 0 Population_2022_Count 0 Population_2023_Count 0 Change_Percentage 0 Continent 0 Region 0 dtype: int64
Step 3: Merge GDP and Population Data¶
# Merge datasets on Country
df_combined = pd.merge(gdp_data, population_data, on='Country', how='inner')
display(df_combined.head())
| Country | IMF_GDP_Billion_USD | IMF_Year | WorldBank_Estimate | WorldBank_Year | UN_Estimate | UN_Year | Population_2022_Count | Population_2023_Count | Change_Percentage | Continent | Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United States | 3050 | 2025-01-01 | 2918 | 2024-01-01 | 2772 | 2023-01-01 | 341534046 | 343477335 | +0.57% | Americas | Northern America |
| 1 | Germany | 4744 | 2025-01-01 | 4659 | 2024-01-01 | 4525 | 2023-01-01 | 84086227 | 84548231 | +0.55% | Europe | Western Europe |
| 2 | India | 4187 | 2025-01-01 | 3912 | 2024-01-01 | 3575 | 2023-01-01 | 1425423212 | 1438069596 | +0.89% | Asia | Southern Asia |
| 3 | Japan | 4186 | 2025-01-01 | 4026 | 2024-01-01 | 4204 | 2023-01-01 | 124997578 | 124370947 | −0.50% | Asia | Eastern Asia |
| 4 | United Kingdom | 3839 | 2025-01-01 | 3643 | 2024-01-01 | 3380 | 2023-01-01 | 68179315 | 68682962 | +0.74% | Europe | Northern Europe |
# Group and sort average GDP
gdp_by_continent = (
df_combined.groupby('Continent')['IMF_GDP_Billion_USD']
.mean()
.sort_values()
.reset_index()
)
# Rename for clarity
gdp_by_continent.columns = ["Continent", "Avg_IMF_GDP_Billion_USD"]
def highlight_by_value(val):
if val < 3000:
return 'background-color: lightcoral'
elif val < 4000:
return 'background-color: gold'
else:
return 'background-color: lightgreen'
styled_df = gdp_by_continent.style.applymap(highlight_by_value, subset=['Avg_IMF_GDP_Billion_USD']) \
.format({'Avg_IMF_GDP_Billion_USD': "{:,.2f}"})
styled_df
| Continent | Avg_IMF_GDP_Billion_USD | |
|---|---|---|
| 0 | Oceania | 2,725.17 |
| 1 | Africa | 3,264.69 |
| 2 | Americas | 3,284.94 |
| 3 | Asia | 3,752.68 |
| 4 | Europe | 4,948.61 |
Step 5: Visualization¶
Population by Continent¶
# Aggregage population by continent
pop_by_continent = (
df_combined.groupby("Continent")["Population_2023_Count"]
.sum()
.reset_index()
)
# Format labels with continent and formatted population
labels = [
f"{row['Continent']} ({row['Population_2023_Count']:,})"
for _, row in pop_by_continent.iterrows()
]
fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(
pop_by_continent["Population_2023_Count"],
labels=labels,
autopct="%1.1f%%", # Shows percent
startangle=140,
colors=plt.cm.Set3.colors
)
ax.set_title("🌍 Population Distribution by Continent (2023)", fontsize=14)
plt.tight_layout()
plt.show()
Top 10 Countries by Population (2023)¶
# Sort by Population 2023
top_population = df_combined.sort_values(by='Population_2023_Count', ascending=False).head(10)
# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x='Country', y='Population_2023_Count', data=top_population, hue = 'Country',
palette='Blues_d')
plt.title('Top 10 Countries by Population (2023)')
plt.ylabel('Population (Billions)')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Treemap showing GDP distribution by continent.¶
import plotly.express as px
# 🌳 Create Treemap
fig = px.treemap(
df_combined,
path=["Continent"],
values="IMF_GDP_Billion_USD",
color="IMF_GDP_Billion_USD",
color_continuous_scale="Viridis",
title="IMF GDP Distribution by Continent (in Billions USD)"
)
fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig.show()
Top 10 Countries by IMF GDP Estimate (2025)¶
# Sort by IMF_GDP_Billion_USD
top_gdp = df_combined.sort_values(by='IMF_GDP_Billion_USD', ascending=False).head(10)
# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x='Country', y='IMF_GDP_Billion_USD', data=top_gdp, palette='Greens_d')
plt.title('Top 10 Countries by IMF GDP Estimate (2025)')
plt.ylabel('GDP (Billion USD)')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Scatter Plot: Population vs GDP¶
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_combined, x='Population_2023_Count', y='IMF_GDP_Billion_USD', hue='Continent', s=100)
plt.title('Population vs GDP (IMF 2023)')
plt.xlabel('Population (2023)')
plt.ylabel('GDP (Billion USD)')
plt.xscale('log')
plt.yscale('log')
plt.grid(True, which="both", ls="--", linewidth=0.5)
plt.tight_layout()
plt.show()
From the scatterplot, countries with larger populations tend to have higher GDP, but it's not linear—population alone doesn’t dictate economic output.
Some countries with moderate population may have disproportionately high GDP (e.g., United States).
Others with large populations but low GDP (e.g., some African nations) highlight regional disparities.
Asian and African nations are more concentrated in lower GDP per capita ranges.
Americas and Europe show more economic diversity even among similarly populated countries.
Top 10 with very high GDP but small population¶
# Create a GDP per Capita Column
df_combined["GDP_per_Capita"] = df_combined["IMF_GDP_Billion_USD"] * 1e9 / df_combined["Population_2023_Count"]
# Filter Top 10 High GDP / Small Population
top_gdp_small_pop = (
df_combined[df_combined["Population_2023_Count"] < 40_000_000]
.sort_values("IMF_GDP_Billion_USD", ascending=False)
.head(10)
)
# Filter Top 10 High Population / Low GDP
high_pop_low_gdp = (
df_combined[df_combined["Population_2023_Count"] > 100_000_000]
.sort_values("IMF_GDP_Billion_USD", ascending=True)
.head(10)
)
# Bar chart
plt.figure(figsize=(10, 6))
sns.barplot(
data=top_gdp_small_pop,
x="IMF_GDP_Billion_USD",
y="Country",
hue='Country',
palette="Greens_r"
)
plt.title("Top 10: High GDP, Small Population")
plt.xlabel("GDP (Billion USD)")
plt.ylabel("Country")
plt.tight_layout()
plt.show()
Top 10 with high population but relatively low GDP¶
plt.figure(figsize=(10, 6))
sns.barplot(
data=high_pop_low_gdp,
x="IMF_GDP_Billion_USD",
y="Country",
hue='Country',
palette="Reds"
)
plt.title("Top 10: High Population, Low GDP")
plt.xlabel("GDP (Billion USD)")
plt.ylabel("Country")
plt.tight_layout()
plt.show()
Key Findings¶
- Population size does not always equate to GDP strength
- Certain small countries dominate GDP (e.g., Luxembourg)
- Africa and Asia house large populations but varying GDP levels
- Americas and Europe show more even distribution between GDP and population
Conclusion¶
This analysis showcases how web scraping, data wrangling, and visualization can uncover socio-economic insights from publicly available datasets. It emphasizes the importance of clean data and contextual visualization.